Fix: WAN 2.2 I2V/T2V training issues + Feature requests for VRAM optimization #474
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR includes critical bug fixes for WAN 2.2 I2V/T2V training and improvements for video training workflows.
Fixes Included
1. MoE Per-Expert LR Logging Fix ✨ NEW
Problem: LR was averaged across all param groups for MoE models, making it impossible to verify per-expert LR adaptation and state preservation.
Solution:
BaseSDTrainProcess.py)lr0: 5.0e-04 lr1: 3.5e-05Files changed:
jobs/process/BaseSDTrainProcess.pyExample output:
2. MoE Transformer Detection Bug Fix ✨ NEW
Problem:
_prepare_moe_optimizer_params()checked for.transformer_1.(dots) but lora_name uses$$separators, so check never matched. All params went into single group instead of separate groups per expert.Solution:
transformer_1without dotstransformer$$transformer_1$$blocks$$0$$attn1$$to_qFiles changed:
toolkit/lora_special.py3. WAN 2.2 I2V Boundary Detection Fix
Problem: The toolkit was hardcoded to use T2V boundary ratio (0.875) for all WAN 2.2 models, causing incorrect timestep distribution for I2V models.
Solution:
Files changed:
extensions_built_in/diffusion_models/wan22/wan22_14b_model.py4. AdamW8bit OOM Crash Fix
Problem: When OOM occurs during training, the progress bar update attempts to access
loss_dictwhich hasn't been populated, causing a KeyError crash.Solution:
not did_oom)Files changed:
jobs/process/BaseSDTrainProcess.py5. Gradient Norm Logging
Problem: No visibility into gradient norms during training, making it difficult to diagnose divergence and LR issues.
Solution:
_calculate_grad_norm()method with comprehensive gradient trackinggrad_normin loss_dict alongside lossFiles changed:
extensions_built_in/sd_trainer/SDTrainer.pyFeatures Included
1. Video-Friendly Bucket Resolutions ✨ NEW
Problem: Previous SDXL-oriented buckets caused excessive cropping for video content with common aspect ratios.
Solution:
resolutions_video_1024with video aspect ratios (16:9, 9:16, 4:3, 3:4)use_video_buckets: trueBenefits:
use_video_buckets: false)Files changed:
toolkit/buckets.py,toolkit/data_loader.py,toolkit/dataloader_mixins.py,toolkit/config_modules.py2. Pixel Budget Scaling ✨ NEW
Problem: Different aspect ratios used inconsistent resolutions, causing variable memory usage and suboptimal quality.
Solution:
max_pixels_per_frameparameter for memory-based scalingmax_pixels_per_frame: 589824(768×768) optimally scales all ratiosBenefits:
max_pixels_per_frameis setFeature Requests
UI/Config Enhancements
1. Automagic Optimizer Support
Request UI fields and validation for the automagic optimizer:
min_lr,max_lr,lr_bump, startinglrBenefit: Automagic is highly effective for WAN 2.2 training but currently requires manual YAML editing.
2. Network Dropout Settings
Add UI field for
network.dropoutparameter.Benefit: Dropout helps prevent overfitting in LoRA training, especially important for small datasets.
3. More Custom Resolutions
Add more resolution presets: 256x256, 320x320, 384x384, 448x448, 512x512
Benefit: Different resolutions have different training characteristics.
4. Training Metrics & Graph Plotting
Add built-in metric tracking and visualization:
Benefit: Currently users must manually parse logs and create graphs.
VRAM Optimization Requests
5. Single LoRA Training Mode for WAN 2.2
Add options to load only HIGH or only LOW noise model.
Benefit: Saves ~7-10GB VRAM by not loading the unused transformer.
6. Fix RAMTorch Implementation for WAN 2.2
Currently doesn't work properly with WAN 2.2 dual transformer architecture.
Benefit: Would enable training on lower VRAM GPUs.
7. PyTorch Nightly + CUDA 13 Support (Blackwell)
Add optional requirements for PyTorch nightly, CUDA 13.x, SM_120.
Benefit: Enables RTX 50-series GPU users to utilize new optimizations.
Testing
All fixes and features have been tested in production WAN 2.2 I2V LoRA training:
Results:
🤖 Generated with Claude Code
Co-Authored-By: Claude [email protected]